Skip to content

Conversation

@wine99
Copy link
Collaborator

@wine99 wine99 commented Nov 3, 2025

Main Changes

  1. Use a stateless graph to fix llama-cli, llama-server, and llama-bench.
  2. Use a single static graph for NPU for both prompt processing and decoding.
    The limitation is that NPU must run with -ub 1 for all utilities, which makes the prompt processing time proportional to the input length.

Preliminary Test

For CPU and GPU:

  • llama-simple, llama-cli, and llama-server work with default command-line arguments.
  • llama-bench needs to be run with the flag -fa 1.

For NPU:

  • llama-cli and llama-server work with -ub 1. For better performance, a smaller context size is recommended (e.g., -c 512).
  • llama-simple does not work as it does not support setting -ub.
  • llama-bench needs to be run with -fa 1 -ub 1. It’s also recommended to use a shorter prompt (e.g., -p 32 -n 32) for faster results.

Running llama-cli on LNL-32GB-Linux:

GGML_OPENVINO_DEVICE=CPU ./llama-cli -m Llama-3.2-1B-Instruct.q4_0.gguf -c 512

> Hi
Hello! How can I help you today?

> EOF by user

llama_perf_sampler_print:    sampling time =       0.58 ms /    20 runs   (    0.03 ms per token, 34482.76 tokens per second)
llama_perf_context_print:        load time =    4290.94 ms
llama_perf_context_print: prompt eval time =      83.88 ms /    11 tokens (    7.63 ms per token,   131.14 tokens per second)
llama_perf_context_print:        eval time =     175.56 ms /     9 runs   (   19.51 ms per token,    51.27 tokens per second)
llama_perf_context_print:       total time =    2283.60 ms /    20 tokens
GGML_OPENVINO_DEVICE=GPU ./llama-cli -m Llama-3.2-1B-Instruct.q4_0.gguf -c 512

> Hi
Hello! How can I help or chat with you today?

> EOF by user

llama_perf_sampler_print:    sampling time =       0.85 ms /    23 runs   (    0.04 ms per token, 27186.76 tokens per second)
llama_perf_context_print:        load time =    5169.48 ms
llama_perf_context_print: prompt eval time =     101.62 ms /    11 tokens (    9.24 ms per token,   108.25 tokens per second)
llama_perf_context_print:        eval time =     294.33 ms /    12 runs   (   24.53 ms per token,    40.77 tokens per second)
llama_perf_context_print:       total time =    2225.93 ms /    23 tokens
GGML_OPENVINO_DEVICE=NPU ./llama-cli -m Llama-3.2-1B-Instruct.q4_0.gguf -c 512 -ub 1

> Hi
Hello! How can I assist you today?

> EOF by user


llama_perf_sampler_print:    sampling time =       2.02 ms /    20 runs   (    0.10 ms per token,  9920.63 tokens per second)
llama_perf_context_print:        load time =    9096.21 ms
llama_perf_context_print: prompt eval time =     603.17 ms /    11 tokens (   54.83 ms per token,    18.24 tokens per second)
llama_perf_context_print:        eval time =     403.52 ms /     9 runs   (   44.84 ms per token,    22.30 tokens per second)
llama_perf_context_print:       total time =   11608.82 ms /    20 tokens

@github-actions github-actions bot added the ggml label Nov 3, 2025
@wine99 wine99 merged commit 2a51cde into dev_backend_openvino Nov 4, 2025
1 check passed
wine99 added a commit that referenced this pull request Nov 4, 2025
* Stateless. Fix llama-cli llama-server

* Simplify broadcast op in attention

* Replace get_output_tensor+memcpy with set_output_tensor

* NPU unify PD. Unify dynamic and static dims
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants